{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import DataFrame\n",
    "import sqlalchemy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We load the weather data below. The fields for this dataset are enumerated [here](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ghcn-daily-by_year-format.rtf)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "weather_data = pd.read_csv('2015.csv', \n",
    "                           header=None,\n",
    "                           index_col=False,\n",
    "                           names=['station_identifier', \n",
    "                                  'measurement_date', \n",
    "                                  'measurement_type', \n",
    "                                  'measurement_flag', \n",
    "                                  'quality_flag', \n",
    "                                  'source_flag', \n",
    "                                  'observation_time'],\n",
    "                           parse_dates=['measurement_date'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>station_identifier</th>\n",
       "      <th>measurement_date</th>\n",
       "      <th>measurement_type</th>\n",
       "      <th>measurement_flag</th>\n",
       "      <th>quality_flag</th>\n",
       "      <th>source_flag</th>\n",
       "      <th>observation_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>US1FLSL0019</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>PRCP</td>\n",
       "      <td>173</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US1TXTV0133</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>PRCP</td>\n",
       "      <td>119</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>USC00178998</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>TMAX</td>\n",
       "      <td>-33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>USC00178998</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>TMIN</td>\n",
       "      <td>-167</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>USC00178998</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>TOBS</td>\n",
       "      <td>-67</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  station_identifier measurement_date measurement_type  measurement_flag  \\\n",
       "0        US1FLSL0019       2015-01-01             PRCP               173   \n",
       "1        US1TXTV0133       2015-01-01             PRCP               119   \n",
       "2        USC00178998       2015-01-01             TMAX               -33   \n",
       "3        USC00178998       2015-01-01             TMIN              -167   \n",
       "4        USC00178998       2015-01-01             TOBS               -67   \n",
       "\n",
       "  quality_flag source_flag observation_time  \n",
       "0          NaN         NaN                N  \n",
       "1          NaN         NaN                N  \n",
       "2          NaN         NaN                7  \n",
       "3          NaN         NaN                7  \n",
       "4          NaN         NaN                7  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a [large number](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt) of categories of weather data in the dataset. For simplicity, we only want to load the data from the 5 \"core\" weather categories into the database for further analysis:\n",
    "\n",
    "* PRCP : Precipitation (tenths of mm)\n",
    "* SNOW : Snowfall (mm)\n",
    "* SNWD : Snow depth (mm)\n",
    "* TMAX : Maximum temperature (tenths of degrees C)\n",
    "* TMIN : Minimum temperature (tenths of degrees C)\n",
    "\n",
    "We also want to cull a few columns from the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "weather_data_subset = weather_data[weather_data.measurement_type.isin(['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN'])][['station_identifier', 'measurement_date', 'measurement_type', 'measurement_flag']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>station_identifier</th>\n",
       "      <th>measurement_date</th>\n",
       "      <th>measurement_type</th>\n",
       "      <th>measurement_flag</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>US1FLSL0019</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>PRCP</td>\n",
       "      <td>173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>US1TXTV0133</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>PRCP</td>\n",
       "      <td>119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>USC00178998</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>TMAX</td>\n",
       "      <td>-33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>USC00178998</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>TMIN</td>\n",
       "      <td>-167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>USC00178998</td>\n",
       "      <td>2015-01-01</td>\n",
       "      <td>PRCP</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  station_identifier measurement_date measurement_type  measurement_flag\n",
       "0        US1FLSL0019       2015-01-01             PRCP               173\n",
       "1        US1TXTV0133       2015-01-01             PRCP               119\n",
       "2        USC00178998       2015-01-01             TMAX               -33\n",
       "3        USC00178998       2015-01-01             TMIN              -167\n",
       "5        USC00178998       2015-01-01             PRCP                 0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather_data_subset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This cuts down the total count of records by about 30%.\n",
    "\n",
    "Now, let's write the weather data to our DB. If you setup the Postgres DB as noted in the README, this should instantiate a connection to the database with your local unix username.\n",
    "\n",
    "If you've configured another Postgres user with a username / password, please fill in the appropriate credentials using the [SQL Alchemy connection string](http://docs.sqlalchemy.org/en/latest/core/engines.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "db_name = 'weather'\n",
    "connection_string = \"postgresql://localhost:5432/%s\" % (db_name)\n",
    "conn = sqlalchemy.create_engine(connection_string)\n",
    "\n",
    "table_name = 'weather_data'\n",
    "# The to_sql method defaults to bigint for integer types here, which are larger than we need. \n",
    "# This manually sets the datatypes of the columns we need to override\n",
    "column_type_dict = {'measurement_flag': sqlalchemy.types.Integer}\n",
    "# Writing all the data to the DB at once will cause this notebook to crash.\n",
    "# We pass a large integer to the chunksize parameter to chunk the writing of records\n",
    "weather_data_subset.to_sql(table_name, conn, chunksize=100000, index_label='id', dtype=column_type_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now, let's process and read in the metadata - which contains the (lat, long) - tied to each weather station."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "station_metadata = pd.read_csv('ghcnd-stations.txt', \n",
    "                           sep='\\s+',  # Fields are separated by one or more spaces\n",
    "                           usecols=[0, 1, 2, 3],  # Grab only the first 4 columns\n",
    "                           na_values=[-999.9],  # Missing elevation is noted as -999.9\n",
    "                           header=None,\n",
    "                           names=['station_id', 'latitude', 'longitude', 'elevation'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>station_id</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>elevation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ACW00011604</td>\n",
       "      <td>17.1167</td>\n",
       "      <td>-61.7833</td>\n",
       "      <td>10.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ACW00011647</td>\n",
       "      <td>17.1333</td>\n",
       "      <td>-61.7833</td>\n",
       "      <td>19.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AE000041196</td>\n",
       "      <td>25.3330</td>\n",
       "      <td>55.5170</td>\n",
       "      <td>34.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AEM00041194</td>\n",
       "      <td>25.2550</td>\n",
       "      <td>55.3640</td>\n",
       "      <td>10.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AEM00041217</td>\n",
       "      <td>24.4330</td>\n",
       "      <td>54.6510</td>\n",
       "      <td>26.8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    station_id  latitude  longitude  elevation\n",
       "0  ACW00011604   17.1167   -61.7833       10.1\n",
       "1  ACW00011647   17.1333   -61.7833       19.2\n",
       "2  AE000041196   25.3330    55.5170       34.0\n",
       "3  AEM00041194   25.2550    55.3640       10.4\n",
       "4  AEM00041217   24.4330    54.6510       26.8"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "station_metadata.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many stations do we have with missing elevation?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4623"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(station_metadata[station_metadata['elevation'].isnull()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write the metadata to the DB:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "metadata_table_name = 'station_metadata'\n",
    "station_metadata.to_sql(metadata_table_name, conn, index_label='id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we want to label the 5 types of weather data noted above with human-readable names. We create a dictionary mapping each label to its description, create a Data Frame from that, and write it to the database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "weather_type_dict = {'PRCP': 'Precipitation', 'SNOW': 'Snowfall', 'SNWD': 'Snow Depth', \n",
    "                     'TMAX': 'Maximum temperature', 'TMIN': 'Minimum temperature'}\n",
    "weather_type_df = DataFrame(weather_type_dict.items(), columns=['weather_type', 'weather_description'])\n",
    "description_table_name = 'weather_types'\n",
    "weather_type_df.to_sql(description_table_name, conn, index_label='id')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
